AITopics | Faroe Islands

Collaborating Authors

Faroe Islands

Optimizing Estonian TV Subtitles with Semi-supervised Learning and LLMs

arXiv.org Artificial IntelligenceJan-9-2025

For instance, Both iterative pseudo-labeling and LLM-based recent studies (Mykhalevych and Preply, 2024; post-editing have been an active area of research Kim et al., 2023) have revealed that 50% of Americans in the context of verbatim automatic speech and 85% of the Netflix users overall frequently recognition (ASR). Pseudo-labeling based semisupervised watch TV and streaming video content learning in ASR has been studied since with subtitles. Studies show that subtitles can enhance at least (Zavaliagkos et al., 1998) and has been understanding and memory retention. A lot later investigated in several works, e.g. by Veselỳ of viewers choose to enjoy their content quietly et al. (2013); Xu et al. (2020).

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2501.05234

Country:

Europe > Estonia (0.14)
Europe > Faroe Islands (0.14)

Genre: Research Report > New Finding (0.94)

Industry:

Leisure & Entertainment (1.00)
Media > Television (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Improving Language Models Trained with Translated Data via Continual Pre-Training and Dictionary Learning Analysis

Boughorbel, Sabri, Parvez, MD Rizwan, Hawasly, Majd

arXiv.org Artificial IntelligenceMay-23-2024

Training LLMs in low resources languages usually utilizes data augmentation with machine translation (MT) from English language. However, translation brings a number of challenges: there are large costs attached to translating and curating huge amounts of content with high-end machine translation solutions, the translated content carries over cultural biases, and if the translation is not faithful and accurate, the quality of the data degrades causing issues in the trained model. In this work we investigate the role of translation and synthetic data in training language models. We translate TinyStories, a dataset of 2.2M short stories for 3-4 year old children, from English to Arabic using the free NLLB-3B MT model. We train a number of story generation models of sizes 1M-33M parameters using this data. We identify a number of quality and task-specific issues in the resulting models. To rectify these issues, we further pre-train the models with a small dataset of synthesized high-quality stories, representing 1\% of the original training data, using a capable LLM in Arabic. We show using GPT-4 as a judge and dictionary learning analysis from mechanistic interpretability that the suggested approach is a practical means to resolve some of the translation pitfalls. We illustrate the improvement through case studies of linguistic issues and cultural bias.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2405.14277

Country:

Asia > Middle East > Qatar (0.14)
Europe > Italy (0.14)
Europe > Faroe Islands (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Evaluating Large Language Models with Human Feedback: Establishing a Swedish Benchmark

Moell, Birger

arXiv.org Artificial IntelligenceMay-22-2024

In the rapidly evolving field of artificial intelligence, large language models (LLMs) have demonstrated significant capabilities across numerous applications. However, the performance of these models in languages with fewer resources, such as Swedish, remains under-explored. This study introduces a comprehensive human benchmark to assess the efficacy of prominent LLMs in understanding and generating Swedish language texts using forced choice ranking. We employ a modified version of the ChatbotArena benchmark, incorporating human feedback to evaluate eleven different models, including GPT-4, GPT-3.5, various Claude and Llama models, and bespoke models like Dolphin-2.9-llama3b-8b-flashback and BeagleCatMunin. These models were chosen based on their performance on LMSYS chatbot arena and the Scandeval benchmarks. We release the chatbotarena.se benchmark as a tool to improve our understanding of language model performance in Swedish with the hopes that it will be widely used. We aim to create a leaderboard once sufficient data has been collected and analysed.

benchmark, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2405.14006

Country: Europe > Faroe Islands (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Safe Training with Sensitive In-domain Data: Leveraging Data Fragmentation To Mitigate Linkage Attacks

Ignashina, Mariia, Ive, Julia

arXiv.org Artificial IntelligenceApr-30-2024

Current text generation models are trained using real data which can potentially contain sensitive information, such as confidential patient information and the like. Under certain conditions output of the training data which they have memorised can be triggered, exposing sensitive data. To mitigate against this risk we propose a safer alternative which sees fragmented data in the form of domain-specific short phrases randomly grouped together shared instead of full texts. Thus, text fragments that could re-identify an individual cannot be reproduced by the model in one sequence, giving significant protection against linkage attacks. We fine-tune several state-of-the-art LLMs using meaningful syntactic chunks to explore their utility. In particular, we fine-tune BERT-based models to predict two cardiovascular diagnoses. Our results demonstrate the capacity of LLMs to benefit from the pre-trained knowledge and deliver classification results when fine-tuned with fragmented data comparable to fine-tuning with full training data.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2404.19486

Country: Europe > Faroe Islands (0.14)

Genre: Research Report > New Finding (0.87)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.49)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.55)

Add feedback

Energy-Efficient Scheduling with Predictions

Balkanski, Eric, Perivier, Noemie, Stein, Clifford, Wei, Hao-Ting

arXiv.org Artificial IntelligenceFeb-26-2024

An important goal of modern scheduling systems is to efficiently manage power usage. In energy-efficient scheduling, the operating system controls the speed at which a machine is processing jobs with the dual objective of minimizing energy consumption and optimizing the quality of service cost of the resulting schedule. Since machine-learned predictions about future requests can often be learned from historical data, a recent line of work on learning-augmented algorithms aims to achieve improved performance guarantees by leveraging predictions. In particular, for energy-efficient scheduling, Bamas et. al. [BamasMRS20] and Antoniadis et. al. [antoniadis2021novel] designed algorithms with predictions for the energy minimization with deadlines problem and achieved an improved competitive ratio when the prediction error is small while also maintaining worst-case bounds even when the prediction error is arbitrarily large. In this paper, we consider a general setting for energy-efficient scheduling and provide a flexible learning-augmented algorithmic framework that takes as input an offline and an online algorithm for the desired energy-efficient scheduling problem. We show that, when the prediction error is small, this framework gives improved competitive ratios for many different energy-efficient scheduling problems, including energy minimization with deadlines, while also maintaining a bounded competitive ratio regardless of the prediction error. Finally, we empirically demonstrate that this framework achieves an improved performance on real and synthetic datasets.

artificial intelligence, opt, planning & scheduling, (17 more...)

arXiv.org Artificial Intelligence

2402.17143

Country: Europe > Faroe Islands (0.14)

Genre: Research Report (0.64)

Industry: Energy (0.68)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.88)

Add feedback

Augmenty: A Python Library for Structured Text Augmentation

Enevoldsen, Kenneth

arXiv.org Artificial IntelligenceDec-9-2023

Text augmentation is useful for tool for training (Wei and Zou 2019) and evaluating (Ribeiro et al. 2020) natural language processing models and systems. Despite its utility existing libraries for text augmentation often exhibit limitations in terms of functionality and flexibility, being confined to basic tasks such as text-classification or cater to specific downstream use-cases such as estimating robustness (Goel et al. 2021). Recognizing these constraints, Augmenty is a tool for structured text augmentation of the text along with its annotations. Augmenty integrates seamlessly with the popular NLP library spaCy (Honnibal et al. 2020) and seeks to be compatible with all models and tasks supported by spaCy. Augmenty provides a wide range of augmenters which can be combined in a flexible manner to create complex augmentation pipelines. It also includes a set of primitives that can be used to create custom augmenters such as word replacement augmenters. This functionality allows for augmentations within a range of applications such as named entity recognition (NER), part-of-speech tagging, and dependency parsing.

artificial intelligence, natural language, text processing, (13 more...)

arXiv.org Artificial Intelligence

2312.0552

Country:

Europe > Faroe Islands (0.15)
Europe > Croatia (0.15)
Asia > China (0.15)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.71)

Add feedback

Assessing Translation capabilities of Large Language Models involving English and Indian Languages

Mujadia, Vandan, Urlana, Ashok, Bhaskar, Yash, Pavani, Penumalla Aditya, Shravya, Kukkapalli, Krishnamurthy, Parameswari, Sharma, Dipti Misra

arXiv.org Artificial IntelligenceNov-15-2023

Generative Large Language Models (LLMs) have achieved remarkable advancements in various NLP tasks. In this work, our aim is to explore the multilingual capabilities of large language models by using machine translation as a task involving English and 22 Indian languages. We first investigate the translation capabilities of raw large language models, followed by exploring the in-context learning capabilities of the same raw models. We fine-tune these large language models using parameter efficient fine-tuning methods such as LoRA and additionally with full fine-tuning. Through our study, we have identified the best performing large language model for the translation task involving LLMs, which is based on LLaMA. Our results demonstrate significant progress, with average BLEU scores of 13.42, 15.93, 12.13, 12.30, and 12.07, as well as CHRF scores of 43.98, 46.99, 42.55, 42.42, and 45.39, respectively, using 2-stage fine-tuned LLaMA-13b for English to Indian languages on IN22 (conversational), IN22 (general), flores200-dev, flores200-devtest, and newstest2019 testsets. Similarly, for Indian languages to English, we achieved average BLEU scores of 14.03, 16.65, 16.17, 15.35 and 12.55 along with chrF scores of 36.71, 40.44, 40.26, 39.51, and 36.20, respectively, using fine-tuned LLaMA-13b on IN22 (conversational), IN22 (general), flores200-dev, flores200-devtest, and newstest2019 testsets. Overall, our findings highlight the potential and strength of large language models for machine translation capabilities, including for languages that are currently underrepresented in LLMs.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2311.09216

Country:

Asia > India (0.14)
Asia > China (0.14)
North America > Canada (0.14)
(2 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Constructing a Knowledge Graph from Textual Descriptions of Software Vulnerabilities in the National Vulnerability Database

Høst, Anders Mølmen, Lison, Pierre, Moonen, Leon

arXiv.org Artificial IntelligenceMay-15-2023

Knowledge graphs have shown promise for several cybersecurity tasks, such as vulnerability assessment and threat analysis. In this work, we present a new method for constructing a vulnerability knowledge graph from information in the National Vulnerability Database (NVD). Our approach combines named entity recognition (NER), relation extraction (RE), and entity prediction using a combination of neural models, heuristic rules, and knowledge graph embeddings. We demonstrate how our method helps to fix missing entities in knowledge graphs used for cybersecurity and evaluate the performance.

artificial intelligence, natural language, relation, (15 more...)

arXiv.org Artificial Intelligence

2305.00382

Country:

Europe > Norway (0.16)
North America > United States (0.14)
Europe > Faroe Islands (0.14)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Seaweed: The food and fuel of the future?

BBC NewsAug-27-2020, 22:57:55 GMT

Sunshine has given way to wind and rain, as the motorboat chugs through a fjord in the Faroe Islands. "Its a bit windy here," says Olavur Gregarsen. "We'll see how far we can get to the harvesting boat." We soon reach a sheltered spot where steep mountains are looking down on hundreds of buoys bobbing in the sea. "They are holding up a horizontal line," explains Mr Gregarsen, the managing director of Ocean Rainforest, a seaweed producer.

artificial intelligence, food and fuel, gregarsen, (13 more...)

BBC News

Country:

North America > United States (0.30)
Europe > Faroe Islands (0.25)

Industry:

Energy (0.73)
Food & Agriculture > Agriculture (0.72)
Consumer Products & Services > Food, Beverage, Tobacco & Cannabis (0.49)

Technology: Information Technology > Artificial Intelligence (0.31)

Add feedback

Filters

Collaborating Authors

Faroe Islands

f99bb39502f09c4825e89760b4e1ad04-Paper-Conference.pdf

Optimizing Estonian TV Subtitles with Semi-supervised Learning and LLMs

Improving Language Models Trained with Translated Data via Continual Pre-Training and Dictionary Learning Analysis

Evaluating Large Language Models with Human Feedback: Establishing a Swedish Benchmark

Safe Training with Sensitive In-domain Data: Leveraging Data Fragmentation To Mitigate Linkage Attacks

Energy-Efficient Scheduling with Predictions

Augmenty: A Python Library for Structured Text Augmentation

Assessing Translation capabilities of Large Language Models involving English and Indian Languages

Constructing a Knowledge Graph from Textual Descriptions of Software Vulnerabilities in the National Vulnerability Database

Seaweed: The food and fuel of the future?